This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Write an R function named explore that takes a data frame, a vector of bin sizes, and a correlation threshold as input parameters: 1) Plot a pair of blue histograms with a vertical red line at the mean (one using counts and the other density) for every numerical variable at each bin size specified in the bin sizes input parameter. You can plot individually or as a grid. If you chose to plot as a grid, there should be separate grids for each count-bin size combination and separate grids for each density-bin size combination. For example, given 5 numeric variables and a vector of three bin sizes will generate 30 individual plots or a total of 6 grid plots (with each grid plot containing 5 subplots). 2) Plot a gray bar graph for every categorical and binary variable. 3) Calculate the r2 (r-square) value for every pair of numerical variables. 4) Return the following in an R list: a. A frequency table for every categorical and binary variable b. Fornumericalvariables i. A summary statistics table for each numerical variable ii. A data frame that contains each pair of variable names and the associated r-square value. iii. A data frame that contains each pair of variable names and correlation coefficient (Pearson) for all coefficients whose absolute value is greater than the correlation threshold (do not repeat any pairs)

explore<-function(dataframe,binsize,cor_threshold){ #Define the explore function with input data dataframe, binsize and cor_threshold
  require(grid) #load grid package for plots
  require(ggplot2) #load ggplot2 package for plots
  #Question 1: we will first find out the numeric varibales and then use for loop to draw the histgram by using ggplot()
  nums<-dataframe[sapply(dataframe,is.numeric)] #using sapply() to find all numeric column and put them into variable nums
  histlist<-list() #create a varable histlist to put histograms, let it be empty first
  for (i in 1:length(binsize)){ #go through from the first number to the last in binsize
    for (j in 1:(ncol(nums))){ #go through from the first column to the last in nums
      binw<-(max(nums[,j])-min(nums[,j]))/binsize[i] #calculating binwidth for histogram by using the input binsize
      histlist<-ggplot(nums,aes(x=nums[,j]),environment=environment()) #using ggplot to draw the plot for every numerical variable, let the ith column to be the x aesthetic and use environment parameter
      histlist<-histlist+geom_histogram(colour="blue",fill="blue",binwidth=binw)+labs(x=colnames(nums)[j])+geom_vline(xintercept=mean(nums[,j]),colour="red") #add blue histogram with the calculated binwidth, label the x label, the y label will be counts automatically, then draw a vertical red line at the mean
      print(histlist) #output histograms using counts
      print(histlist+aes(y=..density..)+labs(y="density")+geom_density()) #output histograms using density and label the y label 
    }#finish the second loop
  }#finish the first loop
  #Question 2:  We find out the factor and logical and binarys variables first. Then put them into a data frame, using foor loop to draw bar graph and put them into a list.
  factors<-dataframe[sapply(dataframe,is.factor)] #using sapply() to find all factor columns and put them into variable factors
  binarys<-data.frame(matrix(ncol=0, nrow=nrow(dataframe))) #create a data frame for binarys and set the row numbers to be the same as dataframe
  a=1 #create a variable a, we will use it to write binarys columns into data frame
  for (i in 1:ncol(dataframe)){ #using for loop to go through from the first column to the last in dataframe
    if (sum(dataframe[,i]==1)+sum(dataframe[,i]==0)==nrow(dataframe)){ #use if() and sum() to check if there are columns only have 0s and 1s
      binarys<-data.frame(binarys,dataframe[,i]) #write the binary variables into binarys variable
      names(binarys)[a]=colnames(dataframe)[i] #make sure the name of the column won't change in new data frame
      a=a+1 #add 1 to a in order to go to the next column in binarys
    }#finish if
  } #finish for loop
  fnb<-data.frame(factors,binarys) #create a data frame fnl and put factors and logicals in it
  plotlist<-list() #create a variable plotlist for a list of plots, make it empty, we will use it to put all plots
  for(i in 1:ncol(fnb)){ #using for loop to go through all variables in fnl
    plotlist[[i]]<-ggplot(fnb,aes_string(x=colnames(fnb)[i]),environment=environment())+geom_bar(colour="gray",fill="grey")+ggtitle(paste(colnames(fnb)[i],"distribution")) #put a gray bar graph for ith column in fnl into plotlist[[i]], label xlabel and write title, and use use environment parameter
  } #finish for loop
  print(plotlist)#output plotlist
  #Question 3: In order to calcualte the r-square value between two variables, we need to create a linear regression between two variables by using lm(), then we use for loop to calculated r-squared and put them into a variable. At last, we can create ta data frame to solve 4bii at the same time.
  Pair_of_variables<-c() #create a variable Pair_of_variables, will put pair of variable names in it
  rsquared<-c() #create a variable rsquared, will put r-square value in it
  n=1 #create a variable n that will represent position in Pair_of_variables and rsquared, we will use it to go through all positions in Pair_of_variables and rsquared, let it equal to 1 first
  for (i in 1:(ncol(nums)-1)){ #use for loop to go through from the first column name to the penult in dataframenum
    for (j in (i+1):ncol(nums)){ #use for loop to go through from the i+1th column name to the last in dataframenum
      Pair_of_variables[n]<-paste(colnames(nums)[i],"-",colnames(nums)[j],sep="") #using paste() to write pair of variable names in a single string separated by a -, and put into Pair_of_variables
      rsquared[n]<-summary(lm(nums[,i] ~ nums[,j]))$r.squared #using summary() and lm() to get the r-square value between two varaibles
      n=n+1 #add 1 to n in order to go to the next positions in Pair_of_variables and rsquared
    } #finish second for loop
  }#finish first for loop
  #Question 4bii
  newdata<-data.frame(Pair_of_variables, rsquared) #create a data frame newdata and put Pair_of_variables and rsquared into it
  print(newdata) #output newdata 
  #Question 4a: for this question, we will first create a variable to store tables. Then use for loop to create frequency tables for every categorical and binary variables. Because mtcars doesn't have factors and 
  tablelist<-list() #create a variable tablelist for a list of tables, make it empty, we will use it to put all tables
  for (i in 1:ncol(fnb)){ #using for loop to go through all categoricals and binary variables in dataframe
    tablelist[[i]]<-as.data.frame(table((fnb)[,i])) #using table() to give the counts of ith column in fnb, convert it to a data frame and put into tablelist[[i]]
    names(tablelist[[i]])[1]=colnames(fnb[i]) #using names() to retain the variable name in the corresponding column name
  } #finish for loop
  print(tablelist) #output tablelist
  #Question 4bi: We will use for loop to create the summary statistics tables for all numerical columns.
  sumtable<-list() #create a variable sumtable for a list of table, make it empty
  for (i in 1:ncol(nums)){ #for each numeric column in the data frame
  sumtable[[i]] <- summary(nums[,i]) #let the summary table of ith column to be the ith element in sumtable
  }#finish foor loop
  print(sumtable) #output sumtable
  #Question 4biii:We will use for loop to create two variables, one is for pair of variable names and the other is the corresponding pearson correlation coefficient. Then, use if() to seclect pearson correlation coefficient>cor_threshold and put the two variables into a data frame.
  Pairofvariables<-c() ##create a variable Pairofvariables, will put pair of variable names in it
  Pearson_cor_coeff<-c() #create a variable Pearson_cor_coeff, will put corresponding Pearson correlation coefficient in it
  n=1 #create a variable n that will represent position in Pairofvariables and Pearson_cor_coeff, we will use it to go through all positions in Pair_of_variables and Pearson_cor_coeff, let it equal to 1 first
  for (i in 1:(ncol(nums)-1)){ #use for loop to go through from the first column name to the penult in nums
    for (j in (i+1):ncol(nums)){ #use for loop to go through from the i+1th column name to the last in nums
      if(cor(nums[ ,i],nums[ ,j],method="pearson")>cor_threshold){ #using if() to check if the pearson correlation coefficient between ith column and jth column in nums larger than the input value cor_threshold
        Pairofvariables[n]  <- paste(colnames(nums)[i],"-",colnames(nums)[j],sep="") #using paste() to write pair of variable names in a single string separated by a -, and put into Pair_of_variables
        Pearson_cor_coeff[n]  <- cor(nums[ ,i],nums[ ,j],method="pearson") #using cor() to calculate the Pearson correlation coefficient between the ith column and jth column in nums, and write down the result in Pearson_cor_coeff
        n=n+1 #add 1 to n in order to go to the next positions in Pairofvariables and Pearson_cor_coeff
      }#finish if
    } #finish second for loop
  }#finish first for loop
  perdata<-data.frame(Pairofvariables, Pearson_cor_coeff) #create a data frame called perdata, put Pairofvariables and Pearson_cor_coeff into it 
  print(perdata) #output perdata
}
  1. Test your function by Using the diamonds data frame you extended to include the VS logical column, a vector of bin sizes (5, 20, 50), and a correlation threshold of 0.25. Also test your function using the mtcars data.
require(ggplot2) #load ggplot2 package to get diamonds data frame
## Loading required package: ggplot2
require(datasets) #mtcars is in package datasets, just make sure we have mtcars
data(diamonds) #load diamonds data frame
logicalcol<-c() #create variable logicalcol, and let it be empty first
ratioT=length(mtcars$vs[mtcars$vs==1])/length(mtcars$vs) #Calculate the ratio of 1 in the mtcars$vs variable, and put it in variable ratioT
trail<-rbinom(nrow(diamonds),1,ratioT) #create variable trail and use rbinom() to randomly input 0 and 1 into trail based on ratioT, and let the length of trail equal to the length of diamonds
for (i in 1:(length(trail))){ #using for loop to go through from the frist value to the last in trail
  if (trail[i]==1){ #using if() to check if trail[i] equals to 1
      logicalcol[i]=TRUE #if trail[i] equals to 1, write TURE in logicalcol[i]
      } #finish if(){}
  else{
    logicalcol[i]=FALSE #if trail[i] doesn't equal to 1, write FALSE in logicalcol[i]
    } #finish eles{}
} #finish for loop
newdiamonds<-data.frame(diamonds,logicalcol) #create a new data frame called newdiamonds and put diamonds and logicalcol into it
explore(newdiamonds, c(5,20,50), 0.25) #test explore() by using newdiamonds
## Loading required package: grid

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
##    Pair_of_variables     rsquared
## 1        carat-depth 0.0007966119
## 2        carat-table 0.0329849332
## 3        carat-price 0.8493305264
## 4            carat-x 0.9508087510
## 5            carat-y 0.9057751441
## 6            carat-z 0.9089474974
## 7        depth-table 0.0874849338
## 8        depth-price 0.0001133672
## 9            depth-x 0.0006395460
## 10           depth-y 0.0008608750
## 11           depth-z 0.0090105434
## 12       table-price 0.0161630291
## 13           table-x 0.0381593881
## 14           table-y 0.0337677917
## 15           table-z 0.0227794699
## 16           price-x 0.7822255540
## 17           price-y 0.7489533305
## 18           price-z 0.7417506045
## 19               x-y 0.9500429745
## 20               x-z 0.9423978849
## 21               y-z 0.9063148836
## [[1]]
##         cut  Freq
## 1      Fair  1610
## 2      Good  4906
## 3 Very Good 12082
## 4   Premium 13791
## 5     Ideal 21551
## 
## [[2]]
##   color  Freq
## 1     D  6775
## 2     E  9797
## 3     F  9542
## 4     G 11292
## 5     H  8304
## 6     I  5422
## 7     J  2808
## 
## [[3]]
##   clarity  Freq
## 1      I1   741
## 2     SI2  9194
## 3     SI1 13065
## 4     VS2 12258
## 5     VS1  8171
## 6    VVS2  5066
## 7    VVS1  3655
## 8      IF  1790
## 
## [[4]]
##   logicalcol  Freq
## 1      FALSE 30487
## 2       TRUE 23453
## 
## [[1]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2000  0.4000  0.7000  0.7979  1.0400  5.0100 
## 
## [[2]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   43.00   61.00   61.80   61.75   62.50   79.00 
## 
## [[3]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   43.00   56.00   57.00   57.46   59.00   95.00 
## 
## [[4]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18820 
## 
## [[5]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   4.710   5.700   5.731   6.540  10.740 
## 
## [[6]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   4.720   5.710   5.735   6.540  58.900 
## 
## [[7]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.910   3.530   3.539   4.040  31.800 
## 
##    Pairofvariables Pearson_cor_coeff
## 1      carat-price         0.9215913
## 2          carat-x         0.9750942
## 3          carat-y         0.9517222
## 4          carat-z         0.9533874
## 5          price-x         0.8844352
## 6          price-y         0.8654209
## 7          price-z         0.8612494
## 8              x-y         0.9747015
## 9              x-z         0.9707718
## 10             y-z         0.9520057
explore(mtcars,c(5,20,50),0.25) #test explore() by using mtcars

## [[1]]
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## 
## [[2]]
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## 
##    Pair_of_variables    rsquared
## 1            mpg-cyl 0.726180005
## 2           mpg-disp 0.718343340
## 3             mpg-hp 0.602437341
## 4           mpg-drat 0.463995168
## 5             mpg-wt 0.752832794
## 6           mpg-qsec 0.175296320
## 7             mpg-vs 0.440947686
## 8             mpg-am 0.359798943
## 9           mpg-gear 0.230673448
## 10          mpg-carb 0.303518437
## 11          cyl-disp 0.813663302
## 12            cyl-hp 0.692968762
## 13          cyl-drat 0.489913363
## 14            cyl-wt 0.612299668
## 15          cyl-qsec 0.349567190
## 16            cyl-vs 0.657415769
## 17            cyl-am 0.273118125
## 18          cyl-gear 0.242740085
## 19          cyl-carb 0.277716662
## 20           disp-hp 0.625599666
## 21         disp-drat 0.504403822
## 22           disp-wt 0.788508342
## 23         disp-qsec 0.188093852
## 24           disp-vs 0.504690738
## 25           disp-am 0.349549413
## 26         disp-gear 0.308657134
## 27         disp-carb 0.156006724
## 28           hp-drat 0.201384745
## 29             hp-wt 0.433948779
## 30           hp-qsec 0.501580369
## 31             hp-vs 0.522868892
## 32             hp-am 0.059148311
## 33           hp-gear 0.015801561
## 34           hp-carb 0.562218742
## 35           drat-wt 0.507571675
## 36         drat-qsec 0.008318308
## 37           drat-vs 0.193845127
## 38           drat-am 0.507957151
## 39         drat-gear 0.489454337
## 40         drat-carb 0.008242788
## 41           wt-qsec 0.030525638
## 42             wt-vs 0.307931409
## 43             wt-am 0.479549684
## 44           wt-gear 0.340223720
## 45           wt-carb 0.182846838
## 46           qsec-vs 0.554333027
## 47           qsec-am 0.052836016
## 48         qsec-gear 0.045233731
## 49         qsec-carb 0.430663050
## 50             vs-am 0.028340081
## 51           vs-gear 0.042445620
## 52           vs-carb 0.324452295
## 53           am-gear 0.630529315
## 54           am-carb 0.003310202
## 55         gear-carb 0.075115920
## [[1]]
##   vs Freq
## 1  0   18
## 2  1   14
## 
## [[2]]
##   am Freq
## 1  0   19
## 2  1   13
## 
## [[1]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.42   19.20   20.09   22.80   33.90 
## 
## [[2]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   4.000   6.000   6.188   8.000   8.000 
## 
## [[3]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    71.1   120.8   196.3   230.7   326.0   472.0 
## 
## [[4]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    52.0    96.5   123.0   146.7   180.0   335.0 
## 
## [[5]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.760   3.080   3.695   3.597   3.920   4.930 
## 
## [[6]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.513   2.581   3.325   3.217   3.610   5.424 
## 
## [[7]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.50   16.89   17.71   17.85   18.90   22.90 
## 
## [[8]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4375  1.0000  1.0000 
## 
## [[9]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4062  1.0000  1.0000 
## 
## [[10]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   3.000   4.000   3.688   4.000   5.000 
## 
## [[11]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   2.812   4.000   8.000 
## 
##    Pairofvariables Pearson_cor_coeff
## 1         mpg-drat         0.6811719
## 2         mpg-qsec         0.4186840
## 3           mpg-vs         0.6640389
## 4           mpg-am         0.5998324
## 5         mpg-gear         0.4802848
## 6         cyl-disp         0.9020329
## 7           cyl-hp         0.8324475
## 8           cyl-wt         0.7824958
## 9         cyl-carb         0.5269883
## 10         disp-hp         0.7909486
## 11         disp-wt         0.8879799
## 12       disp-carb         0.3949769
## 13           hp-wt         0.6587479
## 14         hp-carb         0.7498125
## 15         drat-vs         0.4402785
## 16         drat-am         0.7127111
## 17       drat-gear         0.6996101
## 18         wt-carb         0.4276059
## 19         qsec-vs         0.7445354
## 20         am-gear         0.7940588
## 21       gear-carb         0.2740728